Speeding up the blitter 

by Chris Green 


The new modes provided in the AA chipset have given the Amiga much higher 
resolution and more colors. Unfortunately, since the Blitter chip used to manipulate 
these graphics displays has remained unchanged since 1985, screen-rendering speed 
(always one of the Amiga’s distinctive features) has become a serious bottleneck 
in the new modes. In the extreme case (640x480x256 colors), screen scroll speed 
makes it virtually unusable . Scrolling the screen up one line takes multiple seconds. 
Granted, this is an extreme case, but the lack of a faster blitter has significant affects 
on more reasonable resolutions as well: 

1. A reasonable Workbench set-up for a 4x AA machine is 640x480x16 colors (or 
640x400x16) at 31 khz. This mode has the same degree of display contention as 
the previous best mode (640x200x4 colors). Unfortunately, because there are twice 
as many bytes to be moved , this mode will seem significantly slower. 

2. Games typically run in 320x200x16 or 320x200x32 colors. Without a faster blitter, 
AA games which want to support 256 colors will be forced to either (A) run slower, 
(B) use fewer shapes, or (c) use smaller shapes. None of these options will help 
Amiga games compete with the largely sprite based Super Nintendo and Sega 
systems or with cheap fast PC’s with VGA’s. 

3. Productivity software will suffer. Our user interface is designed with fast graphics in 
mind. For instance, if it is not possible to drags a large 256 color brush around 

on a high resolution screen, new AA paint programs will suffer greatly. 


Possible solutions: 


The most obvious solution is to improve the blitter speed. By making the blitter 
take advantage of double-CAS cycles, it could be sped up by a factor of two. An 
additional factor of two could be obtained by taking advantage of the 32 bit width of 
the data bus. 

If properly done, this would require no software changes, and would transparently 
and automatically speed up the OS graphics routines, and all application software. This 
isthe best solution. 

However if the best solution is not possible due to the constraints on design-time, 
chip size, cycle time, etc., there are other solutions which could significantly improve 
rendering speed without changing the basic blitter memory access architecture. These 


solutions could also provide extra performance even if the 2/4X speed solution were 
implemented. 


My proposal is that 1 new blitter register be added, and that 2 of the unused bits in 
the BLTCON1 register be redefined for control of the new features. 

The proposed new register is BLTA2LWM. The new BLTCON1 bits are 
BMASKEN and MIDCDIS: 

BMASKEN: 

This new bit in BLTCON1 extends the first and last word masking capability which has 
always been available for the A source to the B source. When BMASKEN is set, the 
first and last (and second-to-last) word masks will be applied to the B source instead of 
to the A source. To see why this is useful, you must understand the way the blitter is 
programmed for a typical scroll or fill operation. 

For instance, to scroll an arbitrarily aligned rectangle of a screen window on 
the screen, the D source would be set to point to the destination. The C source would 
be set to the same address so that the first and last words of the blit can be properly 
masked. The B source would point at the source data to be copied. The A source would 
be disabled, its BLTADAT register set to $FFFF, and the first and last word masks set 
to properly mask the left and right edges. 

This results in a D+C+B blit, which takes 8 ticks per word copied. 

Flowever, disabling the A source (one of the "free" sources) and using it only as a 
mask is wasteful. If we could use B for the mask and A for the source data, then this 
would be a D+C+A blit, which only takes 6 cycles per word copied, a 25% improvement. 

Alternatively. B+C-+D blits could be made as fast as A+C+D blits and B+D blits 
made as fast as A+D blits. 


CFOLLOWD: 

Note that in the scroll above, many wasted memory cycles are occurring. The C 
source has been enabled for the entire blit, even though it is only used to properly mask 
the left and right edges of the rectangle being scrolled. One approach would be to do 
the blit as three separate operations, one for each edge, and one for the middle words. 
Unfortunately, this is visually objectionable, especially on slower scrolls (which are the 
ones we want to speed up!). Therefore, I propose a new bit in BLTCON1 which will 
cause the blitter to disable the fetch of the C source for all of the middle words of the 
blit. On the left and right edges, the C source will be fetched using D's pointer (thus the 



C pointer will not have to be loaded). This will allow a D=A blit to happen for all but the 
end words. A D=A blit is the fastest one, taking 4 ticks per word copied. Thus the 
combination of B masking and CFOLLOWD results in a 2X improvement in scroll 

speed, 

C will be fetched on each line for a one word blit, and twice for a 2 word or larger one. 

This capability will approximatley double the speed of: 

Solid fills (rectangles) 

Scrolling 

Window Movement, Closing, Depth Arrangement, etc. 

Text 


BLT2AFWM: 

Now, Let’s examine the problem of "cookie cut blits". A Cookie-Cut blit is used 
when a program wants to move an irregular (non-rectangular) shape around on the 
screen. This is very common, and is used for shapes in games, brushes in paint 
programs, icons on the workbench, etc. 

The shape is represented by a normal set of bitplane data, with one additional 
bitplane added, which serves as an enabling mask. Source pixels which have a T in 
this enabling mask will be copied to the screen, while pixels with a ’0’ will leave the 
destination unmodified. 

In order for this blit to be used with arbitrarily aligned source data being blitted to 
arbitrarily aligned destinations, shifting must be done by the blitter. The mask planes 
must be shifted as well as the source data. 

Unfortunately, shifting requires that one extra word of source be fetched per 
scanline. This extra word is ignored by setting the last-word mask register to zero, which 
prevents that data from participating in the blit. However, using the mask for this 
purpose prevents it from being used to mask the rightmost pixels of the source data 
(remember that the size of the blit must be capable of being an arbitrarily number of 
pixels wide). What is needed to prevent this is a mask for the second-to-last word, 
which would be set to a non-zero value depending upon the width of the source data. 

The solution currently used in the OS for this is to do two entire blits. The first blit 
zero’s out the destination data via the mask, and the second blit the OR’s in the source 
data. This both looks bad and runs slowly. Both blits are full 3 channel blits. The 
addition of a second-to-last-word would add the capability of doing this in only one blit, 
and also fix the visual problem. This would result in a >2X speedup. 

This problem can also occur with non-masked blits when copying arbitrary, 





possibly overlapping rectangles. In this case, we create one scan-line of temporary 
mask data and point a source at it. Having the new register would both eliminate the 
requirement for the temporary buffer, and removed the need to waste cycles fetching 
this buffer. This new bit will require either an enable bit in BPLCON1 or an interlock 
in order to preserve software compatibility. I suggest using one of the unused bits in 
BPLCON1. 

This new capability will vastly up: 

Icon rendering 

Game animation rendering 

Text 

The new mask register will apply only to blits of 3 words wide or greater. 
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